Your mission

Perform text analysis.

Okay, I need more information

Perform sentiment analysis or topic modeling using text analysis methods as demonstrated in the pre-class work and in the readings.

Okay, I need even more information.

Do the above. Can’t think of a data source?

  • gutenbergr
  • AssociatedPress from the topicmodels package
  • NYTimes or USCongress from the RTextTools package
  • Harry Potter Complete 7 Books text ``` if (packageVersion(“devtools”) < 1.6) { install.packages(“devtools”) }

devtools::install_github(“bradleyboehmke/harrypotter”) ``- [State of the Union speeches](https://pradeepadhokshaja.wordpress.com/2017/03/31/scraping-the-web-for-presdential-inaugural-addresses-using-rvest/) - Scrape tweets using [twitteR`](https://www.credera.com/blog/business-intelligence/twitter-analytics-using-r-part-1-extract-tweets/)

Analyze the text for sentiment OR topic. You do not need to do both. The datacamp courses and Tidy Text Mining with R are good starting points for templates to perform this type of analysis, but feel free to expand beyond these examples.

Timelines and Task

We will spend the next 2 weeks working on analyzing textual data in R. You will do the following:

Gather data from Github and store the Harry Potter Complete 7 Books text in a datafrome

# sevenbook is a tidy text format dataframe including 7 novels.
sevenbook
## # A tibble: 409,338 x 4
##    chapter      word              title series
##      <int>     <chr>              <chr>  <int>
##  1       1       boy philosophers_stone      1
##  2       1     lived philosophers_stone      1
##  3       1   dursley philosophers_stone      1
##  4       1    privet philosophers_stone      1
##  5       1     drive philosophers_stone      1
##  6       1     proud philosophers_stone      1
##  7       1 perfectly philosophers_stone      1
##  8       1    normal philosophers_stone      1
##  9       1    people philosophers_stone      1
## 10       1    expect philosophers_stone      1
## # ... with 409,328 more rows

1. Common words Analysis

** 1.1 What are top words in each book? **

# Top 10 words in each novel
top_words
## # A tibble: 70 x 3
## # Groups:   title [7]
##                 title      word     n
##                 <chr>     <chr> <int>
##  1 chamber_of_secrets     harry  1503
##  2 chamber_of_secrets       ron   650
##  3 chamber_of_secrets  hermione   289
##  4 chamber_of_secrets    malfoy   202
##  5 chamber_of_secrets  lockhart   197
##  6 chamber_of_secrets professor   190
##  7 chamber_of_secrets   weasley   157
##  8 chamber_of_secrets    looked   155
##  9 chamber_of_secrets      time   148
## 10 chamber_of_secrets      eyes   145
## # ... with 60 more rows
# Plot the bar chart of top words
graph_top

# From the bar charts, we find that main characters are Harry, Ron and Hermione.
# And most common words are usually related to characters, such as "Dumbledore", "Hagrid", "Snape", "Uncle" and "Professor"...

** 1.2 What are common words in the novels after removing characters’ names? **

# After removing some common words related to the characters ("harry","harry's","potter","ron","hermione","dumbledore","snape","hagrid","weasley","voldemort","Malfoy","professor"), plot the top 10 words in each novel

no_char_graph

# The bar charts show that "looked", "eyes", "time", "voice", "head" (usually the words related to body) ... are in the top words.

2. Character analysis: How do the proportions of the three main characters change along with the novels / chapters? How do the proportions of other characters change along with the novels?

# Calculate the proportion of word in each novel
words_prop
## # A tibble: 63,651 x 5
##                   title series     word     n proportion
##                   <chr>  <int>    <chr> <int>      <dbl>
##  1 order_of_the_phoenix      5    harry  3730 0.03854222
##  2       goblet_of_fire      4    harry  2936 0.04040571
##  3      deathly_hallows      7    harry  2770 0.03773533
##  4    half_blood_prince      6    harry  2581 0.04090462
##  5  prisoner_of_azkaban      3    harry  1824 0.04428474
##  6   chamber_of_secrets      2    harry  1503 0.04470420
##  7 order_of_the_phoenix      5 hermione  1220 0.01260630
##  8   philosophers_stone      1    harry  1213 0.04243484
##  9 order_of_the_phoenix      5      ron  1189 0.01228598
## 10      deathly_hallows      7 hermione  1077 0.01467183
## # ... with 63,641 more rows
# Calculate the words' proportion by chapters in each novel
words_prop_chapter
## # A tibble: 215,433 x 6
##                   title chapter series  word     n proportion
##                   <chr>   <int>  <int> <chr> <int>      <dbl>
##  1   chamber_of_secrets      19      2 harry   173 0.05158020
##  2       goblet_of_fire      31      4 harry   161 0.05439189
##  3       goblet_of_fire      26      4 harry   159 0.05033238
##  4  prisoner_of_azkaban      21      3 harry   153 0.05694083
##  5       goblet_of_fire      28      4 harry   152 0.05175349
##  6 order_of_the_phoenix      35      5 harry   149 0.04623022
##  7 order_of_the_phoenix      24      5 harry   147 0.04766537
##  8    half_blood_prince      18      6 harry   145 0.05228994
##  9       goblet_of_fire      20      4 harry   144 0.05801773
## 10       goblet_of_fire      23      4 harry   144 0.04551201
## # ... with 215,423 more rows

** 2.1 How do the proportions of the three main characters change along with the novels? **

# Plot the proportions of the three main characters in each book. 

prop_book_graph

## The propotion of harry and ron slightly decreases with the series, while the proportion of hermione slightly increases. In the second book (chamber of secrets) there is a relatively big gap between the proportion of Ron and Hermione.

** 2.2 How do the proportions of three main characters change along with the chapters in each book? **

# Draw line plots of each novel to compare the proportion change 

prop_chapter_graph

## For the fans of ron or hermione, they can find in which chapter the character has a relatively high proportion. For example, in the first book (philosophers stone), Ron and Hermione first appear in the 6th chapter.

** 2.3 How do the proportions of other characters change along with the novels? **

other_prop

# The line plot shows that the proportion of Hagrid goes down along with the novels. Overall, the proportion of Dumbledore goes up from 1 to 6 and it drops in the 7th novel. 

3. Sentiment analysis

** 3.1 What are common joy words and sad words in the seven novels? **

# Extract joy words from sentiment dataset NRC.

nrcjoy
## # A tibble: 689 x 2
##             word sentiment
##            <chr>     <chr>
##  1    absolution       joy
##  2     abundance       joy
##  3      abundant       joy
##  4      accolade       joy
##  5 accompaniment       joy
##  6    accomplish       joy
##  7  accomplished       joy
##  8       achieve       joy
##  9   achievement       joy
## 10       acrobat       joy
## # ... with 679 more rows
# Use inner_join to perform the sentiment analysis.

joy
## # A tibble: 1,713 x 3
##                   title  joyword     n
##                   <chr>    <chr> <int>
##  1 order_of_the_phoenix ministry   191
##  2 order_of_the_phoenix    found   164
##  3 order_of_the_phoenix  feeling   145
##  4       goblet_of_fire  magical   129
##  5       goblet_of_fire ministry   115
##  6    half_blood_prince ministry   113
##  7       goblet_of_fire    found   108
##  8      deathly_hallows ministry    96
##  9    half_blood_prince    found    91
## 10      deathly_hallows    found    87
## # ... with 1,703 more rows
# We can see in each novel, the common joy words is "found". Also, "magical", "hope", "smile"... are frequently used joy words in seven books.

joy_graph

# Extract sad words from sentiment dataset NRC.

nrcsad
## # A tibble: 1,191 x 2
##           word sentiment
##          <chr>     <chr>
##  1     abandon   sadness
##  2   abandoned   sadness
##  3 abandonment   sadness
##  4   abduction   sadness
##  5    abortion   sadness
##  6    abortive   sadness
##  7     abscess   sadness
##  8     absence   sadness
##  9      absent   sadness
## 10    absentee   sadness
## # ... with 1,181 more rows
# Use inner_join to perform the sentiment analysis.
sad
## # A tibble: 2,559 x 3
##                   title sadword     n
##                   <chr>   <chr> <int>
##  1 order_of_the_phoenix   harry  3730
##  2       goblet_of_fire   harry  2936
##  3      deathly_hallows   harry  2770
##  4    half_blood_prince   harry  2581
##  5  prisoner_of_azkaban   harry  1824
##  6   chamber_of_secrets   harry  1503
##  7   philosophers_stone   harry  1213
##  8  prisoner_of_azkaban   black   332
##  9       goblet_of_fire   moody   309
## 10      deathly_hallows   death   305
## # ... with 2,549 more rows
# We can see in each novel, the common sad words is "black", "dark". Also, "kill", "bad", "leave", "death"... are frequently used sad words in seven books. If we use NRC to do the sentiment analysis, we will find something wierd, since "mother" is in both joy and sad words list.

sad_graph

# Check the word "mother" in NRC lexicon. We can see that "mother" can be different sentiment. So when analysis sentiment here, we should not take "mother" into account.

get_sentiments("nrc")%>%filter(word=="mother")
## # A tibble: 6 x 2
##     word    sentiment
##    <chr>        <chr>
## 1 mother anticipation
## 2 mother          joy
## 3 mother     negative
## 4 mother     positive
## 5 mother      sadness
## 6 mother        trust

** 3.2 How does the sentiment change along with the novels / chapters? Does it become more positive or negative? **

# 3.2.1 Compare the ratio of negative and positive words used in the seven books. Bigger ratio indicate more negative sentiment.

ratio_np

# The line graph shows that the ratio of negative and positive words fluctuates, a high ratio usually followed by a relatively low ratio in the next book, except that the ratio of "prisoner_of_azkaban"" is higher than "chamber of secrets".

# 3.2.2 How does the ratio change through chapters in each book?

ratio_chapter_np

# The line graphs of each book show that at the end of the story, the ratio of negative and postive words declines to a lower level, which means the story has a relatively "happy ending". Also according to the fluctuation of each book, we know the ups and downs of the sentiment. For example, in the half blood prince, there is a peak of negative sentiment in chapter 29.

4. Examine how sentiment changes throughout each novel/chapter using section

# Create a tidy text format that record the line number of each word.

series
## # A tibble: 409,485 x 5
##    chapter linenumber      word              title series
##      <int>      <int>     <chr>              <chr>  <int>
##  1       1          1       boy philosophers_stone      1
##  2       1          1     lived philosophers_stone      1
##  3       1          2   dursley philosophers_stone      1
##  4       1          2    privet philosophers_stone      1
##  5       1          2     drive philosophers_stone      1
##  6       1          2     proud philosophers_stone      1
##  7       1          2 perfectly philosophers_stone      1
##  8       1          2    normal philosophers_stone      1
##  9       1          3    people philosophers_stone      1
## 10       1          3    expect philosophers_stone      1
## # ... with 409,475 more rows
# Use Bing lexicon to analyze how sentiment changes along with sections. Here sentiment = positive-negative.

series_bing

## Usually, there are more negative words in each section.


# Use AFINN lexicon to analyze how sentiment changes along with sections. Here sentiment=sum(score).

series_afinn

## The results seem to be more reasonable by using AFINN lexicon, because the AFINN lexicon has the score of each word. 

# Take philosophers stone as an example to examine how sentiment changes throughout the chapter - bing

sentence_sent
## # A tibble: 141 x 5
##    chapter index negative positive sentiment
##      <int> <dbl>    <dbl>    <dbl>     <dbl>
##  1       1     0       16       13        -3
##  2       1     1       26       14       -12
##  3       1     2       11        9        -2
##  4       1     3       12        4        -8
##  5       1     4       15       13        -2
##  6       1     5       22       17        -5
##  7       1     6       14       20         6
##  8       1     7       14       20         6
##  9       2     0       16       12        -4
## 10       2     1       19       17        -2
## # ... with 131 more rows
stone_graph

# We can see in which chapter there are more sections that have more positive sentiment, such as chapter 5, 6, 7.

5. Using wordcloud to find the most common words in Harry Potter

sevenbook%>%
  count(word)%>%
  with(wordcloud(word,n,max.words=100))

# Throughout the seven books, according to the wordcloud, we also get the main characters are "Harry", "Ron", "Hermione", "Dumbledore" and "Hagrid"... And some common words are "looked", "time", "magic", "eyes"...

** Find the most common positive and negative words **

sevenbook%>%
  inner_join(get_sentiments("bing"))%>%
  count(word,sentiment,sort=T)%>%
  acast(word~sentiment,value.var="n",fill=0)%>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"),
                   max.words=50)  
## Joining, by = "word"

# From the word cloud, we find that the most common positive words throughout the series are "magic", "top", "happy", "gold", "love", "nice"... And the most common negative words are "dark", "fell", "hard", "death"...

6. What is the relationship of words in Harry Potter? Create bigram and analyze the relationship between words.

# Examine the most common bigrams

bigram_n
## # A tibble: 523,420 x 3
## # Groups:   title [7]
##                   title     bigram     n
##                   <chr>      <chr> <int>
##  1 order_of_the_phoenix     of the  1192
##  2      deathly_hallows     of the  1002
##  3       goblet_of_fire     of the   901
##  4 order_of_the_phoenix     in the   872
##  5    half_blood_prince     of the   707
##  6 order_of_the_phoenix said harry   689
##  7      deathly_hallows     in the   673
##  8       goblet_of_fire     in the   673
##  9 order_of_the_phoenix     at the   607
## 10 order_of_the_phoenix     on the   603
## # ... with 523,410 more rows
# The most common bigrams are some we are not interested in, such as " of the ", "in the ". And most of them are in stop words.
# Remove cases where either is a stop-word
# new bigram counts

bigram_counts
## # A tibble: 89,120 x 3
##           word1      word2     n
##           <chr>      <chr> <int>
##  1    professor mcgonagall   578
##  2        uncle     vernon   386
##  3        harry     potter   349
##  4        death     eaters   346
##  5        harry     looked   316
##  6        harry        ron   302
##  7         aunt    petunia   206
##  8 invisibility      cloak   192
##  9    professor  trelawney   177
## 10         dark       arts   176
## # ... with 89,110 more rows
# We can see that names are the most common pairs in Harry Potter series. 


# The table shows the number of occurence of any 2 characters among "Harry", "Ron" and "Hermione"

character_relationship
## # A tibble: 6 x 4
##      word1    word2     n  rank
##      <chr>    <chr> <int> <int>
## 1    harry      ron   302     6
## 2      ron hermione    84    33
## 3    harry hermione    59    63
## 4      ron    harry    54    71
## 5 hermione    harry    35   143
## 6 hermione      ron    23   249
# Harry and ron usually appear together. 
# Also, Ron and Hermione usually appear together.
# Unite and analyze

bigrams_united
## # A tibble: 107,016 x 4
##                 title series               bigram     n
##                 <chr>  <int>                <chr> <int>
##  1 philosophers_stone      1         uncle vernon    97
##  2 philosophers_stone      1 professor mcgonagall    90
##  3 philosophers_stone      1         aunt petunia    52
##  4 philosophers_stone      1         harry potter    26
##  5 philosophers_stone      1         harry looked    22
##  6 philosophers_stone      1 professor dumbledore    20
##  7 philosophers_stone      1   professor quirrell    18
##  8 philosophers_stone      1     hermione granger    16
##  9 philosophers_stone      1         privet drive    16
## 10 philosophers_stone      1   professor flitwick    15
## # ... with 107,006 more rows
# And Professor Mcgonagall is a common character in Harry Potter. From the plot, we find that in the book "Order Of the Phoenix" , the frequency goes up.

united_graph

# We find that in goblet_of_fire, Harry and Ron usually appear together. 

bigram_harry
## # A tibble: 8,566 x 4
##                   title series       bigram     n
##                   <chr>  <int>        <chr> <int>
##  1       goblet_of_fire      4    harry ron    86
##  2 order_of_the_phoenix      5 harry looked    76
##  3      deathly_hallows      7 harry looked    60
##  4       goblet_of_fire      4 harry looked    58
##  5 order_of_the_phoenix      5    harry ron    54
##  6  prisoner_of_azkaban      3    harry ron    49
##  7      deathly_hallows      7    harry ron    40
##  8    half_blood_prince      6 harry looked    36
##  9    half_blood_prince      6    harry ron    34
## 10   chamber_of_secrets      2 harry looked    33
## # ... with 8,556 more rows
# Analyze sentiment associated with Harry with "AFINN"

harry_sentiment 
## # A tibble: 447 x 3
##         word score     n
##        <chr> <int> <int>
##  1      yeah     1    47
##  2   reached     1    29
##  3      dear     2    26
##  4      lied    -2    24
##  5   laughed     1    22
##  6   feeling     1    21
##  7  bitterly    -2    19
##  8      fire    -2    19
##  9   stopped    -1    18
## 10 nervously    -2    17
## # ... with 437 more rows
# The figure shows the common positive and negative sentiment words associated with "Harry".

harry_graph

** network of bigrams **

# Filter for only relatively common combination ( the occurrences of the 2 words are more than 60 )

bigram_graph
## IGRAPH 3711453 DN-- 85 60 -- 
## + attr: name (v/c), n (e/n)
## + edges from 3711453 (vertex names):
##  [1] professor   ->mcgonagall uncle       ->vernon    
##  [3] harry       ->potter     death       ->eaters    
##  [5] harry       ->looked     harry       ->ron       
##  [7] aunt        ->petunia    invisibility->cloak     
##  [9] professor   ->trelawney  dark        ->arts      
## [11] professor   ->umbridge   death       ->eater     
## [13] entrance    ->hall       madam       ->pomfrey   
## [15] dark        ->lord       professor   ->dumbledore
## + ... omitted several edges
# From the figure, we can visualize relational tidy data of the Harry Potter. The figure corresponds to the table (bigram_counts) we get. "Professor Mcgonagall", " Uncle Vernon "... are common combinations.

network